import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import folium
import plotly.express as px
import plotly.graph_objects as go
import branca.colormap as cm
This project explores wind disaster risk across districts in Sleman Regency using clustering techniques. The analysis aims to identify high-risk districts, detect unusual patterns, and provide actionable insights for disaster mitigation.
Objectives:
sleman = pd.read_csv("DATA FIX.csv")
These are the elements of the dataset:
sleman.keys()
Index(['Kecamatan', 'Tahun', 'BanyakKejadian ', 'PohonTumbang ',
'JaringanListrik '],
dtype='object')
Remove extra spaces from column names to ensure consistency.
sleman.columns = sleman.columns.str.strip()
The dataset has 34 entries and 5 columns:
sleman.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 34 entries, 0 to 33 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Kecamatan 34 non-null object 1 Tahun 34 non-null int64 2 BanyakKejadian 34 non-null int64 3 PohonTumbang 34 non-null int64 4 JaringanListrik 34 non-null int64 dtypes: int64(4), object(1) memory usage: 1.5+ KB
print(sleman.isnull().sum())
Kecamatan 0 Tahun 0 BanyakKejadian 0 PohonTumbang 0 JaringanListrik 0 dtype: int64
The dataset was cleaned and verified with no missing values.
The Kecamatan shapefile was imported and converted to standard CRS (EPSG:4326).
kecamatan = gpd.read_file("shp sleman.shp").to_crs(epsg=4326)
We focus on key features representing the impact of wind disasters:
features = ['BanyakKejadian','PohonTumbang','JaringanListrik']
We visualize the distribution of disaster-related features across to identify extreme values (hotspots).
# Convert Data
df_melt = sleman.melt(value_vars=features, var_name='Feature', value_name='Value')
#Create boxplot
fig = px.box(df_melt, x='Feature', y='Value', color='Feature', title="Distribution Feature")
fig.update_layout(showlegend=False)
fig.show()
Number of incidents and fallen trees show high outliers, indicating specific districts face very high risk. These are the primary wind disaster hotspots.
Before clustering, it's useful to understand how the features relate to each other.
plt.figure(figsize=(8,6))
sns.heatmap(sleman[features].corr(), annot=True, cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap"); plt.show()
From the heat map, we can see that fallen trees strongly correlate with power network disruption, highlighting them as a key factor amplifying disaster impact. Total incidents show moderate correlation with other features.
To prepare the data for clustering, we need all features to contribute equally. Since the features have different scales, we scale them to a 0-1 range.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(sleman[features])
from sklearn.cluster import KMeans
To identify the most suitable number of clusters, two complementary approaches were applied:
The Elbow Method evaluates the Within-Cluster Sum of Squares (WSS) for different values of k. The "elbow" point indicates the optimal cluster count.
wss = [KMeans(n_clusters=i, random_state=42, n_init=10).fit(X_scaled).inertia_ for i in range(1,10)]
# Create plot
plt.plot(range(1,10), wss, marker='o')
plt.xlabel("k"); plt.ylabel("WSS"); plt.title("Elbow Method"); plt.show()
Silhouette Analysis measures how well each data point fits within its assigned cluster. A higher Silhouette Score indicates better-defined clusters.
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import SilhouetteVisualizer
# Range k cluster
k_values = [2, 3, 4, 5]
for k in k_values:
model = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = model.fit_predict(X_scaled)
# Create silhouette score
score = silhouette_score(X_scaled, labels)
# Create plot
fig, ax = plt.subplots(figsize=(6,4))
SilhouetteVisualizer(model, ax=ax).fit(X_scaled).finalize()
plt.title(f"Silhouette k={k} (Score={score:.3f})", fontsize=11)
plt.show()
Elbow and Silhouette analyses suggest 3 clusters as the optimal choice.
kmeans = KMeans(n_clusters=3, random_state=123, n_init=10)
sleman['KMeans_Cluster'] = kmeans.fit_predict(X_scaled)
Calculated the centroids of each cluster and summed their feature values.
# Calculate centroid
centroids = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=features)
centroids['Total'] = centroids.sum(axis=1)
# Assign risk levels
risk_order = ['High','Moderate','Low'] # highest Total = High Risk
centroids_sorted = centroids.sort_values('Total', ascending=False).reset_index()
cluster_to_risk = {row['index']: risk_order[i] for i,row in centroids_sorted.iterrows()}
sleman['KMeans_Risk'] = sleman['KMeans_Cluster'].map(cluster_to_risk)
fig = px.scatter_3d(sleman, x='BanyakKejadian', y='PohonTumbang', z='JaringanListrik',
color='KMeans_Risk', symbol='KMeans_Risk', hover_data=['Kecamatan', 'Tahun'])
colors = {'High':'red','Moderate':'green','Low':'blue'}
for i,row in centroids_sorted.iterrows():
fig.add_trace(go.Scatter3d(x=[row['BanyakKejadian']], y=[row['PohonTumbang']], z=[row['JaringanListrik']],
mode='markers', marker=dict(size=12, color=colors[risk_order[i]], symbol='x'),
name=f"{risk_order[i]} Centroid", showlegend=False))
fig.update_layout(title="K-Means Clusters 3D", legend_title_text='Risk Level')
fig.show()
Insight:
High Risk - "Red zone"
Districts like Cangkringan stand out with high incidents, severe tree falls, and frequent power outages the top priority for mitigation.
Moderate Risk – “Green zone”
Mid-level points with mixed patterns, e.g., high tree falls but low outages, signaling potential escalation if not addressed.
Low Risk – “Blue zone”
Close to the origin with minimal events, these districts remain relatively safe but still require monitoring.
from sklearn.cluster import DBSCAN
eps¶from sklearn.neighbors import NearestNeighbors
# Set parameter min_samples
min_samples = 3
# Compute distances to the nearest neighbors
distances, _ = NearestNeighbors(n_neighbors=min_samples).fit(X_scaled).kneighbors(X_scaled)
# Plot K-Distance
plt.plot(np.sort(distances[:, min_samples-1]))
plt.title("K-Distance Plot")
plt.xlabel("Points sorted by distance")
plt.ylabel(f"Distance to {min_samples}th nearest neighbor")
plt.show()
From the K-Distance Plot, the value 0.15 was chosen at the elbow point, where the slope of the curve increases sharply.
eps_optimal = 0.15
sleman['DBSCAN_Cluster'] = DBSCAN(eps=eps_optimal, min_samples=min_samples).fit_predict(X_scaled)
After applying DBSCAN, we map each cluster label to a descriptive risk level.
risk_map_dbscan = {-1:'Noise', 0:'Low', 1:'Moderate', 2:'High'}
sleman['DBSCAN_Risk'] = sleman['DBSCAN_Cluster'].map(risk_map_dbscan)
We track DBSCAN cluster labels by year for each district to see movement across risk levels. To prepare the data, we "unpivot" the dataframe from wide format to long format.
# Create a pivot table
trend_dbscan = sleman.pivot_table(index='Kecamatan', columns='Tahun', values='DBSCAN_Cluster').reset_index()
# Convert to long format
trend_long = trend_dbscan.melt(id_vars="Kecamatan", var_name="Year", value_name="Cluster")
trend_long['RiskLevel'] = trend_long['Cluster'].map(risk_map_dbscan)
# Create line chart
fig = px.line(trend_long, x='Year', y='Cluster', color='Kecamatan', markers=True, hover_data=['RiskLevel'],
labels={'Kecamatan':'District'})
fig.update_yaxes(tickvals=[-1, 0, 1, 2], title="Cluster (DBSCAN)")
fig.update_layout(title="Wind Disaster Risk Trends")
fig.show()
By visualizing year-over-year trends, we observed that districts such as Tempel, Berbah, Ngemplak, and Kalasan rose from Low to Moderate risk, highlighting areas needing closer monitoring. Godean and Sleman dropped to Noise, indicating minimal or inconsistent risk, while Prambanan and Depok reduced from High to Low, showing effective mitigation. This demonstrates that wind disaster risk is dynamic, and adaptive strategies are essential to focus on rising-risk districts while keeping an eye on outliers.
We merged the Sleman district shapefile with data Sleman per year.
df_gdf = kecamatan.merge(sleman, left_on='KECAMATAN', right_on='Kecamatan', how='left')
Then we calculated a total feature score for each district by summing key indicators.
# Calculate total feature
df_gdf['Total_Feat'] = df_gdf[features].sum(axis=1)
# Calculate persentase Total_Feat
trend_total = df_gdf.pivot_table(index='Kecamatan', columns='Tahun', values='Total_Feat')
trend_total['Pct_Change'] = ((trend_total[2021]-trend_total[2020])/trend_total[2020]*100).round(1)
# Merge Pct_Change
df_gdf = df_gdf.merge(trend_total['Pct_Change'], left_on='Kecamatan', right_index=True)
df_gdf['Risk_Change'] = df_gdf.groupby('Kecamatan')['KMeans_Risk'].transform(lambda x: x.iloc[-1]!=x.iloc[0])
We visualized wind disaster risk in Sleman districts for 2020–2021 using an interactive map. Districts are color-coded by K-Means and DBSCAN clusters, reflecting different risk levels per method. Users can toggle layers and hover over districts to track year-over-year changes, spot hotspots needing mitigation, and recognize areas that remain stable or resilient.
def plot_multi_year_highlight_map(gdf, years):
# Colormap total features
colormap_total = cm.LinearColormap(['green','yellow','red'], vmin=gdf['Total_Feat'].min(), vmax=gdf['Total_Feat'].max(),
caption='Total Feature Count')
# Default color per method
colors_dict = {
'KMeans': {'Low':'#00FFFF','Moderate':'#F77F00','High':'#D62828'},
'DBSCAN': {'Noise':'#A9A9A9','Low':'#008000','Moderate':'#FFA500','High':'#FF0000'}}
# Base map
m = folium.Map(location=[-7.7,110.4], zoom_start=11, tiles="cartodbpositron")
for year in years:
gdf_year = gdf[gdf['Tahun']==year]
for method in ['KMeans','DBSCAN']:
fg = folium.FeatureGroup(name=f'{method} {year}')
# Add GeoJson
folium.GeoJson(gdf_year,
style_function=lambda f, method=method, colors=colors_dict[method]: {
'fillColor': colors.get(f['properties'].get(f"{method}_Risk"), 'Low'),
'color':'black',
'weight': 3 if f['properties'].get(f"{method}_Risk") in ['Low','Noise'] else 1,
'fillOpacity':0.7},
tooltip=folium.GeoJsonTooltip(
fields=['Kecamatan', f'{method}_Risk','BanyakKejadian','PohonTumbang','JaringanListrik','Total_Feat','Pct_Change'],
aliases=['Kecamatan','Risk Level','Banyak Kejadian','Pohon Tumbang','Jaringan Listrik','Total Features','% Change 2020→2021'])
).add_to(fg)
fg.add_to(m)
# Add colormap & layer control
colormap_total.add_to(m)
folium.LayerControl().add_to(m)
return m
map_highlight = plot_multi_year_highlight_map(df_gdf, [2020, 2021])
map_highlight
Calculate percentage of districts that escalated to High risk in 2021.
pct_high_increase = round(
trend_long[(trend_long['Year'] == 2021) & (trend_long['RiskLevel'] == 'High')].shape[0]
/ trend_long['Kecamatan'].nunique() * 100, 1)
print(f"Out of {trend_long['Kecamatan'].nunique()} districts, {pct_high_increase}% escalated to High risk in 2021.")
Out of 17 districts, 0.0% escalated to High risk in 2021.
Out of 17 districts, 0.0% escalated to High risk in 2021, confirming that mitigation efforts successfully prevented new high-risk zones.
Let's create a stacked bar to see distribution of risk levels across districts for each year.
# Create summary of district counts per RiskLevel for each year
trend_summary = trend_long.groupby(['Year', 'RiskLevel']).size().reset_index(name='Count')
# Create stacked bar
fig = px.bar(trend_summary, x='Year', y='Count', color='RiskLevel',
color_discrete_map={'Low':'#2A9D8F','Moderate':'#F77F00','High':'#D62828','Noise':'grey'},
title="Wind Disaster Risk Distribution each Year", labels={'Count':'Number of Districts'},)
fig.update_layout(barmode='stack', xaxis=dict(tickmode='linear'), yaxis_title='Number of Districts',
legend_title_text='Risk Level', template='plotly_white', font=dict(size=12))
fig.show()
The chart clearly shows that High-risk districts decreased or remained stable, while Moderate-risk districts need ongoing monitoring. Low-risk districts maintained safety, demonstrating resilience. Noise segments reveal outliers or districts with irregular patterns that require targeted investigation. This visualization provides stakeholders a year-over-year perspective to prioritize mitigation and allocate resources effectively.
The wind disaster risk analysis in Sleman Regency reveals a dynamic landscape of resilience and vulnerability. High-risk districts (“red zones”) like Cangkringan demand urgent mitigation, while moderate-risk areas act as early-warning zones requiring close monitoring. Low-risk districts remain relatively safe but must still be observed, especially where outlier patterns appear. Year-over-year trends from 2020 to 2021 show no new districts escalated to high risk, highlighting the effectiveness of ongoing mitigation. By combining K-Means and DBSCAN clustering, along with interactive maps and trend visualizations, this analysis provides actionable insights to prioritize interventions, allocate resources efficiently, and strengthen community preparedness.